Skills: Describing variables
This page illustrates some skills that will be useful to you as you learn to describe variables in a dataset
For the R examples, I’ll be using the following packages and load a
dataset called my_data.
library(tidyverse)
library(knitr)
library(here)
my_data <- here("datasets",
"test.csv") |>
read_csv() Descriptive statistics
You may want to describe your data with descriptive statistics.
Mean
The mean is the average value of a variable. It is a measure of central tendency that is sensitive to outliers.
R
The function in R that will calculate a mean is called
mean().
## [1] 110382.4
Excel
In Excel, the function is AVERAGE().
Standard deviation
The standard deviation is a measure of how spread out the data are. It is sensitive to outliers.
R
In R, the function sd() will calculate a standard
deviation.
## [1] 71810.03
Excel
In Excel, the function to calculate a standard deviation is
STDEV.P().
Median and other percentiles
Percentiles are the values that a specified percentage of observations of a variable are lower than. The 50th percentile is also called the median: It’s the value that half of all observations are less than.
The median is measure of central tendency that is not sensitive to outliers.
The range between lower and upper percentiles can be a useful measure of central tendency. For example, the interquartile range (IQR) is the range between the 25th and 75th percentile. Half of all values fall within this range.
R
In R, you can calculate a percentile with the function
quantile(). The median is the 50th percentile.
## 50%
## 93083.57
You can also calculate the 25th and 75th percentiles.
## 25% 75%
## 62351.48 137676.30
And the difference between them is the interquartile range.
## [1] 75324.81
Excel
In Excel, the function to calculate percentiles is
PERCENTILE.EXC().
Proportions
Averages and standard deviations don’t make sense for categorical variables. The appropriate descriptive statistics would be proportions.
R
In R, you can use the table() function to get the number
of observations in each category.
##
## Duplex Single-family Three-plus
## 2353 5305 2342
And then you can express these as proportions by dividing those numbers by the number of observations.
##
## Duplex Single-family Three-plus
## 0.2353 0.5305 0.2342
Excel
If you’ve set up a column for each category, where a value of 1 means the observation is in that category and a value of zero means it is not, the average of one of those columns is the proportion of observations within that category.
You can use an IF() function in Excel to convert
categories to zeros and ones.
Confidence intervals
In general, the data sets you work with represent samples of larger populations. Your goal is to be able to generalize from your sample to the population.
Sometimes this is literally and obviously true. For example, you might have sent a survey to one percent of all Canadian households, and you want to analyze the data to make some claims about the characteristics of Canadian households in general.
Other times, describing your sample in terms of the population it comes from is not as obvious. For example, you might have data about every county in the United States suggesting that incomes are higher in urban counties than in rural counties. You might argue that this is not a sample, because you have data on every county that exists. However, if you want to claim that this difference is based on a fundamental difference that is likely to always exist between urban and rural counties in the United States, then you need to be able to generalize this difference to a broader population, which would be all hypothetical counties that could ever exist in the United States.
Confidence intervals for averages
A confidence interval for an average is a range of values that you can have a specified level of confidence that the the average value for the full population falls within. A one-sample t-test is a method for calculating a confidence interval.
R
Here is how you would do a one-sample t-test in R.
##
## One Sample t-test
##
## data: my_data$`sales-price`
## t = 153.71, df = 9999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 108974.7 111790.0
## sample estimates:
## mean of x
## 110382.4
By default, the t.test function returns a 95-percent
confidence interval, but you can also specify other confidence levels.
Here is how you would get the 99-percent confidence level.
##
## One Sample t-test
##
## data: my_data$`sales-price`
## t = 153.71, df = 9999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
## 108532.3 112232.4
## sample estimates:
## mean of x
## 110382.4
Excel
To calculate the 95-percent confidence interval in Excel, you’d need to know that the confidence margin is the standard deviation times the z-score corresponding to the confidence level you want.
So, first you calculate the standard deviation of the sample.
Then you calculate the z-score for the desired confidence level using
the function NORM.INV.S(). The probability you should use
for a 95% confidence level is 0.975. Or, if you use a probability of
0.025, you’ll get the opposite (i.e. negative) of the z-score you
need.
Now you can calculate the upper and lower bounds of the confidence interval.
Confidence intervals for proportions
Confidence intervals for proportions work the same way as confidence intervals for averages, but you use the standard error of the proportion instead of the standard deviation.
R
In R, you use prop.test rather than t.test when you want to find the confidence interval for a proportion.
##
## 1-sample proportions test with continuity correction
##
## data: sum(my_data$building == "Single-family") out of nrow(my_data), null probability 0.5
## X-squared = 37.088, df = 1, p-value = 1.129e-09
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.5206585 0.5403179
## sample estimates:
## p
## 0.5305
Excel
To calculate the 95-percent confidence interval in Excel, you’d need to know that the confidence margin is the standard error times the z-score corresponding to the confidence level you want.
The standard error is given by the equation:
\[ S.E. = \sqrt{\frac{p\times(1-p)}{n}} \]
Where p is the proportion and n is the number of observations.
Then you calculate the z-score and the upper and lower bounds of the confidence interval as you would for a continuous variable.
Visualizing data
You might find it useful to visualize a variable in your dataset using a histogram, and one-dimensional scatterplot, or a bar chart.
Histogram
A histogram is a useful way to visualize the distribution of a continuous variable.
R
In R (ggplot2), geom_histogram will generate a
histogram.
Here, I’ll visualize the distribution of a variable in my dataset called “store size.”
Excel
You can create a histogram in Excel using the data analysis tool pack. By default, it will put gaps between the bars. It is more common to show histograms without these gaps, so you can remove them.
One-dimesional scatterplot
A one-dimentional scatteplot can be a much more literal representation of the the distribution of a dataset.
R
In R, you can use geom_jitter() to generate a scattplot.
You can set the x-coordinate as the variable you want to visualize, and
set the y-coordinate as a constant, and R will randomly vary the y
positions.
You might want to turn off the y-axis labels since they’ll be meaningless.
You might also want to make the points a little bit transparent to avoid overplotting.
ggplot(my_data) +
geom_jitter(aes(x = `sales-price`, y = 0),
alpha = 0.1) +
theme_linedraw() +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank())Excel
In Excel, to acheive that same effect, you need to create a fake y-variable that varies randomly.
Then you can create a scatterplot.
Bar charts
You might use a bar chart to represent proportions.
R
In R, you can use geom_bar() to represent proportions
with a stacked bar chart.
ggplot(my_data) +
geom_bar(aes(fill = building, x = 1),
position = "fill") +
scale_y_continuous(name = "Proportion",
breaks = c(0.2,
0.4,
0.6,
0.8,
1),
labels = c("20%",
"40%",
"60%",
"80%",
"100%")) +
theme_linedraw() +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())ggplot(my_data) +
geom_bar(aes(x = building)) +
scale_y_continuous(breaks = breaks <- c(2000, 4000, 6000),
labels = paste0(100*breaks / nrow(my_data), "%"),
name = "Percentage") +
theme_linedraw() You can also calculate the proportions and their confidence intervals and make a little table for them like this.
prop_summary <- my_data |>
mutate(single_family = building == "Single-family",
duplex = building == "Duplex",
three_plus = building == "Three-plus") |>
select(single_family,
duplex,
three_plus) |>
pivot_longer(cols = everything(),
names_to = "which_building",
values_to = "is_type") |>
group_by(which_building) |>
summarise(Proportion = mean(is_type),
`C.I. Low` = prop.test(x = sum(is_type==1),
n = n(),
conf.level = 0.95)$conf.int[1],
`C.I. High` = prop.test(x = sum(is_type==1),
n = n(),
conf.level = 0.95)$conf.int[2])
prop_summary## # A tibble: 3 × 4
## which_building Proportion `C.I. Low` `C.I. High`
## <chr> <dbl> <dbl> <dbl>
## 1 duplex 0.235 0.227 0.244
## 2 single_family 0.530 0.521 0.540
## 3 three_plus 0.234 0.226 0.243
And then we can use that to make a bar chart with error bars
ggplot(prop_summary,
aes(x = which_building,
y = Proportion)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = `C.I. Low`,
ymax = `C.I. High`)) +
theme_linedraw()Excel
In Excel, if you set up a little table with the proportions and margins of error like this:
You can create a bar chart like this:
And add error bars like this:
A little shortcut
You can download a couple of helpful R functions here: https://github.com/c-voulgaris/GSD-SES-5215-F24/blob/main/sample%20assignments/descriptive%20statistics/helpers.R
Save it to your project folder and add this line of code to the top
of your script (or RMarkdown or Quarto file), right after where you load
all your libraries. Use the here() function to indicate
where in your project folder you have saved the helpers.R
script
Now you have two functions you can use:
descriptive_histogram will generate a histogram with the
labels for descriptive statistics. You need to specify the variable,
labels for the x an y axes, and the number of decimal places you want to
include on all numbers.
distance_histogram <- descriptive_histogram(my_data$city_dist,
label_x = "city_dist",
label_y = "Number of home sales",
decimal_places = 2)
distance_histogramYou can optionally display the histogram with a log-transformed x-axis.
distance_histogram <- descriptive_histogram(my_data$city_dist,
label_x = "Distance from city (miles)",
label_y = "Number of home sales",
decimal_places = 2,
is_log = TRUE)
distance_histogramdescriptive_cat_bars will give you a bar chart with
error bars and labels.
type_plot <- descriptive_cat_bars(my_data$building,
x_label = "Type of building",
y_label = "Proportion of homes")
type_plot